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Agenda 

• What is this talk about? 

• Why porting a cloth simulation to the GPU? 

• The first attempts - A new approach 

• The shader - Easy parts - Complex parts 

• Optimizing the shader 

• The PS4 version 

• What you can do & cannot do in compute shader 

• Tips & tricks 
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What is this talk about? 

• Cloth simulation ported to the GPU 

• For PC DirectX 11, Xbox One and PS4 
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What is this talk about? 

• Cloth simulation ported to the GPU 

• For PC DirectX 11, Xbox One and PS4 

• This talk is about all that we have learned 
during this adventure 
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• Why porting a cloth simulation to the GPU? 
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Why porting a cloth simulation to the GPU? 


5 ms of CPU time 


# of 
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Why porting a cloth simulation to the GPU? 

5 ms of CPU time 


Xbox360 

PS3 


# of 
dancers 


34 

105 
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Why porting a cloth simulation to the GPU? 


5 ms of CPU time 



Now 

let's switch 
to next gen! 
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Why porting a cloth simulation to the GPU? 


5 ms of CPU time 


# of 
dancers 



Xbox360 

PS3 


34 


105 


PS4 
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Why porting a cloth simulation to the GPU? 


5 ms of CPU time 



5 SPUs 6 cores 

@ 3.2 GHz @ 1.6 GHz 
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Why porting a cloth simulation to the GPU? 


5 ms of CPU time 


# of 
dancers 



Xbox360 

PS3 

PS4 


34 

105 
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Why porting a cloth simulation to the GPU? 

Next gen doesn't look sexy! 
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What is the 
solution? 
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Why porting a cloth simulation to the GPU? 


Peak power: XbOX One 



PS4 
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The first attempts 


nviDiA 

CUDA. 


Easy to use 

a Not available on all platforms 
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The first attempts 


nviDiA 

CUDA. 


Easy to use 

a Not available on all platforms 


C++ AMP 

Accelerated Massive Parallelism 
with Microsoft - Visual C++ 


# Close to C+ + 

□ Black box: no possibility to 
know what's going on 



DirectCompute 
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The first attempts 


Integrate velocity 


Resolve some constraints 


Resolve collisions 


Resolve some more constraints 


Do some other funny stuffs 


GDCEUROPE.COM 
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The first attempts 


Integrate velocity 


Resolve some constraints 


Resolve collisions 


Resolve some more constraints 


Do some other funny stuffs 


GDCEUROPE.COM 


Compute Shader 
Compute Shader 
Compute Shader 
Compute Shader 
Compute Shader 


Compute Shader 
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The first attempts 
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120% 

100% 

80 % 

60 % 

40 % 

20% 

0% 



CPU GPU 


GDCEUROPE.COM 


Too many "Dispatch" 
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The first attempts 


140 % 

120% 

100% 

80 % 

60 % 

40 % 

20% 

0% 



CPU GPU 


GDCEUROPE.COM 


Too many "Dispatch" 


Bottleneck = CPU 
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The first attempts 


Merge several cloth items to 
get better performance 


All cloth items must have 
the same properties 
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A new approach 

• A single huge compute shader to 
simulate the entire cloth 

• Synchronization points inside the shader 
A single "Dispatch" instead of 50+ 
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A new approach 


• A single huge compute shader to 
simulate the entire cloth 

• Synchronization points inside the shader 
A single "Dispatch" instead of 50+ 

• Simulate several cloth items (up to 32) 
using a single "Dispatch" 


200% 


160% 



CPU GPU 
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Complex parts 
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The shader 

• 41 .hlsl files 

• 3,100 lines of code 

(+ 800 lines for unit tests & benchmarks) 

• Compiled shader code size = 69 KB 
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The shader - Easy parts 

• Thread group: 

0 12 3 4 5 63 

[ [ | [ | [ i ... LI 


• We do the same operation on 64 vertices at a time 



There must be no dependency between the threads 
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The shader - Easy parts 


Read some global properties to apply (ex: gravity, wind) 


Read position 
of vertex 1 


Read position 
of vertex 63 


Read position 
of vertex 0 
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The shader - Easy parts 


Read some global properties to apply (ex: gravity, wind) 


Read position 
of vertex 0 

Read position 
of vertex 1 


Compute 

Compute 


Write position 
of vertex 0 

Write position 
of vertex 1 


Read position 
of vertex 63 


Compute 


Write position 
of vertex 63 
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The shader - Easy parts 


Read some global properties to apply (ex: gravity, wind) 


Read position 
of vertex 64 

Read position 
of vertex 65 


Compute 

Compute 


Write position 
of vertex 64 

Write position 
of vertex 65 


Read position 
of vertex 127 


Compute 


Write position 
of vertex 127 
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The shader - Easy parts 


Read position 
of vertex 0 

Read position 
of vertex 1 


Read property 
for vertex 0 

Read property 
for vertex 1 


Read position 
of vertex 63 


Read property 
for vertex 63 
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The shader - Easy parts 


Read position 
of vertex 0 

Read position 
of vertex 1 


Read property 
for vertex 0 

Read property 
for vertex 1 


Compute 

Compute 


Write position 
of vertex 0 

Write position 
of vertex 1 


Read position 
of vertex 63 


Read property 
for vertex 63 


Compute 


Write position 
of vertex 63 
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The shader - Easy parts 


Read property 
for vertex 0 

Read property 
for vertex 1 

■ ■ ■ 

Read property 
for vertex 63 






t 

Ensure contiguous reads to get good performance 
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The shader - Easy parts 


Read property 
for vertex 0 

Read property 
for vertex 1 

■ ■ ■ 

Read property 
for vertex 63 






t 

Ensure contiguous reads to get good performance 


Coalescing = 1 read instead of 16 

i.e. use Structure of Arrays (SoA) instead of Array of 
Structures (AoS) 
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The shader - Complex parts 

• Binary constraints: 


Constraint 


Vertex A Vertex B 
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The shader - Complex parts 


• Binary constraints: 
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The shader - Complex parts 


• Binary constraints: 
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The shader - Complex parts 


• Binary constraints: 
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The shader - Complex parts 

• Binary constraints: 
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The shader - Complex parts 

• Binary constraints: Grou P 1 
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The shader - Complex parts 


• Binary constraints: 


Group l 



Group 2 
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The shader - Complex parts 


• Binary constraints: 


Group l 



Group 2 


Group 3 
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The shader - Complex parts 


• Binary constraints: 



Group l 

GroupMemoryBarrierWithGroupSyncO 

Group 2 

GroupMemoryBarrierWithGroupSync() 
Group 3 

GroupMemoryBarrierWithGroupSync() 
Group 4 


48 / 122 



GAME DEVELOPERS CONFERENCE'” EUROPE 2014 AUGUST 11-13, 2014 


GDCEUROPE.COM 


The shader - Complex parts 

• Collisions: Easy or not? 

• Collisions with vertices w Easy 
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The shader - Complex parts 

• Collisions: Easy or not? 

• Collisions with vertices w Easy 

• Collisions with triangles 

w Each thread will modify the 
position of 3 vertices 

w You have to create groups 
and add synchronization 
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• Optimizing the shader 


GDCEUROPE.COM 
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Optimizing the shader 

• General rule: 



Data compression: 



Vertex 


CPU 


128 bits 
(4 floats) 
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Optimizing the shader 


• General rule: 


Bottleneck = memory bandwidth 


• Data compression: 



Vertex 


CPU 


128 bits 
(4 floats) 


Normal 


128 bits 
(4 floats) 
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Optimizing the shader 

• General rule: 



Data compression: 



Vertex 


Normal 


CPU 


128 bits 
(4 floats) 

128 bits 
(4 floats) 


GPU 


64 bits 
(21:21:21:1) 
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Optimizing the shader 

• General rule: 



Data compression: 



Vertex 


Normal 


CPU 


128 bits 
(4 floats) 

128 bits 
(4 floats) 


GPU 


64 bits 
(21:21:21:1) 

32 bits 
(10:10:10) 
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Optimizing the shader 


Use Local Data Storage (aka Local Shared Memory) 






Compute Unit 
(12 on Xbox One, 
18 on PS4) 


VRAM 
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Optimizing the shader 

• Store vertices in Local Data Storage 

Copy vertices from VRAM to LDS 
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Optimizing the shader 

• Store vertices in Local Data Storage 


— 

Copy vertices from LDS to VRAM 


Copy vertices from VRAM to LDS ) 
Step 1 - Update vertices 


Step 2 - Update vertices 


Step n - Update vertices 
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Optimizing 

• Use bigger 
thread groups 


the shader 

0 12 3 4 5 63 


Load 

Wait 

Compute 
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Optimizing 

• Use bigger 
thread groups 


the shader 

0 12 3 4 5 63 


Load 

Wait 

Compute 

Load 

Wait 


Compute 
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Optimizing 

• Use bigger 
thread groups 


the shader 

0 12 3 4 5 63 

hhi- ■ 


Load 


64 127 



Load 
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Optimizing 

• Use bigger 
thread groups 


With 256 or 
512 threads, 
we hide most 
of the latency! 


the shader 

0 12 3 4 5 63 64 127 


Load 

^ Load 


* k 


Compute 


Compute 
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Optimizing the shader 

0 12 3 4 5 63 

UUUUUU---U 




Dummy vertices 
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Optimizing the shader 

0 12 3 4 5 63 

UUUUUU---U 


Dummy vertices 


Useless work! 
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Optimizing the shader 

0 12 34 5 63 64 127 

UULLUU ■ ■ -ULLLULU ■ ■ -U 
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Optimizing the shader 

0 12 34 5 63 64 127128 191192 255 

UULLLJU ■ ■ ■UULUJUU ■ ■ ■UUUUJUU ■ ■ ■ULLJLIJLIJ ■ ■ -U 
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Optimizing the shader 



-♦-64 

-♦-128 

-♦-256 

-♦-512 

Cloth's 

vertices 
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Optimizing the shader 



-♦-64 

-♦-128 

-♦-256 

-♦-512 

Cloth's 

vertices 


68 / 122 



GAME DEVELOPERS CONFERENCE'” EUROPE 2014 AUGUST 11-13, 2014 


GDCEUROPE.COM 


Optimizing the shader 




* & <$> 

\ 


Cloth's 

vertices 
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• What is this talk about? 

• Why porting a cloth simulation to the GPU? 

• The first attempts 

• A new approach 

• The shader - Easy parts - Complex parts 

• Optimizing the shader 

• The PS4 version 

• What you can do & cannot do in compute shader 

• Tips & tricks 
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The PS4 version 

• Port from HLSL to PSSL 


#ifde1 PSSL 

numthreads 

NUM THREADS 

SV_Grouplndex 

S GROUP INDEX 

SV_GrouplD 

S_GROUP_ID 

StructuredBuffer 

RegularBuffer 

RWStructuredBuffer 

RW_RegularBuffer 

ByteAddressBuffer 

ByteBuffer 

RWByteAddressBuffer 

RW_ByteBuffer 

GroupMemoryBarrierWithGroupSync ThreadGroupMemoryBarrierSync 

groupshared 

thread_group_memory 

#endif 
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The PS4 version 



Synchronization 


72 / 122 



GAME DEVELOPERS CONFERENCE'” EUROPE 2014 AUGUST 11-13, 2014 


GDCEUROPE.COM 


The PS4 version 



Synchronization 
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The PS4 version 

• On PS4: 


No implicit synchronization, no implicit buffer duplication 
You have to manage everything by yourself 



Potentially better performance because you know when 
you have to sync or not 
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The PS4 version 

• We use labels to know if a buffer is still in use 
by the GPU 

• Still used -> Automatically allocate a new buffer 

• "Used" means used by a compute shader or a copy 

• We also use labels to know when a compute shader 
has finished, to copy the results 
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• What you can do & cannot do in compute shader 
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What you can do 


Peak power: XbOX One 


Gflops 


1800 

1600 

1400 

1200 

1000 

800 

600 

400 

200 

0 




CPU 


GPU 


GDCEUROPE.COM 


in compute shader 

PS4 
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What you can do in compute shader 

+ Using DirectCompute, you can do almost 
everything in compute shader 

^ The difficulty is to get good performance 
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What you can do in compute shader 

• Efficient code = you work on 64+ data at a time 


(threadlndex < 32) 


(threadlndex == 0) 

{ 


i 

■ ■ ■ 

}; 


■ ■ ■ 

}; 
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What you can do in compute shader 


• Efficient code = you work on 64+ data at a time 



This is likely 
to be the 
bottleneck 
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What you can do in compute shader 

• Example: collisions 

• On the CPU: 
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What you can do in compute shader 

• Example: collisions 

• On the CPU: 
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What you can do in compute shader 


• Example: collisions 

• On the GPU: 



Compute a bounding volume 

(ex: Axis-Aligned Bounding Box) 



t 

Just doing this can be more costly than 
computing the collision with all vertices!!! 
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What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

LLLUJJ ■ ■ -U 
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What you can do in compute shader 


Compute 64 sub-AABoxes 


0 12 3 4 5 63 

UULLUU---U 
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What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes JJL J---U 



We use only 32 
threads for that 
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What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes JULJ---U 

• Reduce down to 16 sub-AABoxes 




We use only 16 
threads for that 
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What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes J J---U 

• Reduce down to 16 sub-AABoxes 

• Reduce down to 8 sub-AABoxes 


We use only 8 
threads for that 
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What you can do in compute shader 


Compute 64 sub-AABoxes 012345 

Reduce down to 32 sub-AABoxes 
Reduce down to 16 sub-AABoxes 
Reduce down to 8 sub-AABoxes 
Reduce down to 4 sub-AABoxes 


63 

■U 




We use only 4 
threads for that 
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What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes J J---U 

• Reduce down to 16 sub-AABoxes 

• Reduce down to 8 sub-AABoxes 

• Reduce down to 4 sub-AABoxes 

• Reduce down to 2 sub-AABoxes^ 1 we use only 2 

threads for that 
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What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes U ---LI 

• Reduce down to 16 sub-AABoxes 

• Reduce down to 8 sub-AABoxes 

• Reduce down to 4 sub-AABoxes 

• Reduce down to 2 sub-AABoxes 

• Reduce down to 1 AABox 


We use a single 
thread for that 
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What you can do in compute shader 




Compute 64 sub-AABoxes 
Reduce down to 32 sub-AABoxes 
Reduce down to 16 sub-AABoxes 
Reduce down to 8 sub-AABoxes 
Reduce down to 4 sub-AABoxes 
Reduce down to 2 sub-AABoxes 
Reduce down to 1 AABox 


This is ~ as 
costly as 
computing the 
collision with 
7 x 64 = 448 
vertices! ! 
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What you can do in compute shader 

• Atomic functions are available 

w You can write lock-free thread-safe containers 

• Too costly in practice 
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What you can do in compute shader 

• Atomic functions are available 
w You can write lock-free thread-safe containers 


• Too costly in practice 

The brute-force approach is 
almost always the fastest one 
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What you can do in compute shader 

• Atomic functions are available 
w You can write lock-free thread-safe containers 




Too costly in practice 



The brute-force approach is 
almost always the fastest one 


• Bandwidth usage 

• Data compression 

• Memory coalescing 

• LDS usage 
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What you can do in compute shader 


Port an algorithm to the GPU 
only if you find a way 
to handle 64+ data at a time 
95+% of the time 
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• What is this talk about? 

• Why porting a cloth simulation to the GPU? 

• The first attempts 

• A new approach 

• The shader - Easy parts - Complex parts 

• Optimizing the shader 

• The PS4 version 

• What you can do & cannot do in compute shader 

• Tips & tricks 
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Sharing code between C++ & hlsl 


( WIN32) || (_WIN64) 

|| (_DURANGO) || ( ORBIS ) 

typedef unsigned long uint 

float2{ x, y; }; 

float3{ x, y, z; }; 

float4{ x, y, z, w; }; 

uint2 { x, y; }; 

uint3 { x, y, w; }; 

uint4 { x, y, z, w; }; 

#endif 
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Debug buffer 

DebugBuffer 

{ 


}; 


GDCEUROPE.COM 
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Debug buffer 

Debug Buffer 

{ 

m_Velocity; 
m_Weight; 

}; 


USE_DEBUG_BUFFER 


<DebugBuffer> g DebugBuffer : 

(ui); 


USE DEBUG BUFFER 
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Debug buffer 



Debug Buffer *debug Buffer = Get Debug BuffeiQ; 


101 / 122 






GAME DEVELOPERS CONFERENCE'” EUROPE 2014 AUGUST 11-13, 2014 


What to put in LDS? 



LDS 


Random 

access? 


GDCEUROPE.COM 


> 
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What to put in LDS? 
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Memory consumption in LDS 

• LDS = 64 KB per compute unit pi— f 

• 1 thread group can access 32 KB 
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Memory consumption in LDS 


• LDS = 64 KB per compute unit 

• 1 thread group can access 32 KB 



32 32 


w 2 thread groups can run 
simultaneously on the same 
compute unit 
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Memory consumption in LDS 


• LDS = 64 KB per compute unit 

• 1 thread group can access 32 KB 



32 32 


w 2 thread groups can run 
simultaneously on the same 
compute unit 


• Less memory used in LDS 

More thread groups can run in parallel 
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Memory consumption in LDS 


• LDS = 64 KB per compute unit 

• 1 thread group can access 32 KB 

w 2 thread groups can run 
simultaneously on the same 
compute unit 



32 32 


21 21 21 




• Less memory used in LDS 

More thread groups can run in parallel 
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Memory consumption in LDS 


• LDS = 64 KB per compute unit 

• 1 thread group can access 32 KB 

w 2 thread groups can run 
simultaneously on the same 
compute unit 



32 32 


21 21 21 




• Less memory used in LDS 

More thread groups can run in parallel 
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Optimizing bank access in LDS? 

• LDS is divided into several banks (16 or 32) 

• 2 threads accessing the same bank Conflict 
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Optimizing bank access in LDS? 

• LDS is divided into several banks (16 or 32) 

• 2 threads accessing the same bank -> Conflict 

Visible impact on performance on older PC 
hardware 

Negligible on Xbox One, PS4 and newer PC 
hardware 
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Beware the compiler 

CopyFromVRAMToLDSO; 

ReadlnputFromLDSO; 
DoSomeComputations() ; 
WriteOutputToLDS() ; 

ReadlnputFromLDSO; 
DoSomeComputationsO ; 
WriteOutputToLDSO ; 

CopyFromLDSToVRAMQ; 
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Beware the compiler 

CopyFromVRAMToLDSO; 

FteadlnputFromLDSO; 
DoSomeComputations() ; 
WriteOutputToLDSQ ; 

ReadlnputFromLDSO; 
DoSomeComputations() ; 
WriteOutputToLDSQ ; 
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Beware the compiler 


CopyFrom VRAMToLDS() ; 

ReadlnputFromLDSO; 
DoSomeComputations() ; 
WriteOutputToLDS() ; 

ReadlnputFromLDSO; 
DoSomeComputationsO ; 

WriteOutputToLDS() ; 
CopyFromLDSToVRAM(); 



The last copy 
takes all the time 


This doesn't 
make sense! 
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Beware the compiler 

CopyFromVRAMToLDS(); 

FteadlnputFromLDSO; 
DoSomeComputationsO ; 
WriteOutputToLDS() ; 

ReadlnputFromLDS(); 
DoSomeComputationsO ; 
WriteOutputToLDSQ ; 
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Optimizing compilation time 

float3 fan Blades 10]; 
for (uint i = 0; i < 10; ++i) 

{ 

Vertex fanVertex = GetVertexlnLDS(neighborFan.m_Vertexlndex[i]); 
fanBlades[i] = fanVertex.m_Position - fanCenter.m_Position; 

} 

float3 normalAccumulator = cross(fanBlades[0], fanBlades[1]); 
for (uint j = 0; j < 8; ++j) 

{ 

float3 triangleNormal = cross(fanBlades[j+1], fanBlades[j+2]); 
uint isTriangleFilled = neighborFan.m_FilledFlags & (1 « j); 
if (isTriangleFilled) normalAccumulator += triangleNormal; 


} 
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Optimizing compilation time 

float3 fan Blades 10]; 
for (uint i = 0; i < 10; ++i) 

{ 

Vertex fanVertex = GetVertexlnLDS(neighborFan.m_Vertexlndex[i]); 
fanBlades[i] = fanVertex.m_Position - fanCenter.m_Position; 

} 

float3 normalAccumulator = cross(fanBlades[0], fanBlades[1]); 
for (uint j = 0; j < 8; ++j) 

{ 

float3 triangleNormal = cross(fanBlades[j+1], fanBlades[j+2]); 
uint isTriangleFilled = neighborFan.m_FilledFlags & (1 « j); 
if (isTriangleFilled) normalAccumulator += triangleNormal; 


} 
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Optimizing compilation time 


float3 normalAccumulator = cross(fanBlades[0], fanBlades[1]); 
for (uint j = 0; j < 8; ++j) 

{ 

float3 triangleNormal = cross(fanBlades[j+1], fanBlades[j+2]); 
uint isTriangleFilled = neighborFan.m_FilledFlags & (1 « j); 


if (isTriangleFilled) normalAccumulator += triangleNormal; 

} 


float3 fan Blades 10]; 
for (uint i = 0; i < 10; ++i) 


Vertex fanVertex = GetVerte 

fanBlades[i] = fanVertex.m_ Manually unrolled 



Loop 


19" 


6 " 
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Iteration time 

• It's really hard to know which code will run the fastest. 

• The "best" method: 

• Write 10 versions of your feature. 

• Test them. 

• Keep the fastest one. 


118 / 122 



GAME DEVELOPERS CONFERENCE'” EUROPE 2014 AUGUST 11-13, 2014 


GDCEUROPE.COM 


Iteration time 

• It's really hard to know which code will run the fastest. 

• The "best" method: 

• Write 10 versions of your feature. 

• Test them. 

• Keep the fastest one. 

• A fast iteration time really helps 
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Bonus: final performance 


Next gen can be sexy after all! 





Xbox360 PS 3 PS4 CPU Xbox PS4 GPU Xbox 

One CPU One GPU 
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PS4 - 2 ms of GPU time - 640 dancers 
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Thank you! 


Questions? 



/h»T!ON 
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